Infectious Diseases in Germany by Stefan Borchardt

Iteration 1

1.1 Plots

I use data on infectious diseases which is collected by the Robert Koch Institute in Germany as an official statistic. From various customizable selections available, I chose the incidence of 14 diseases which can cause stomach flu or diarrhea, with the help of a doctor. My idea is that patient’s characteristics can help to point out the most likely causes of similar symptoms.

Because I selected which data to include through the interface at https://survstat.rki.de , I already have some understanding of the structure of the data. On the other hand, because I did the data wrangling myself, I have to check that joins and value transformations worked as intended.

At first, I display some textual summaries:

## 'data.frame':    12416 obs. of  16 variables:
##  $ date         : Date, format: "2001-01-01" "2001-01-08" ...
##  $ age          : Factor w/ 16 levels "A00..00","A01..01",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ campylobacter: num  0.29 0.44 0.44 1.6 0.44 0.58 0.29 0.15 0.73 0.44 ...
##  $ ecoli        : num  0.44 0.73 2.03 3.34 3.2 1.74 2.91 1.74 2.32 2.76 ...
##  $ ehec         : num  0 0.44 0.29 0 0.15 0 0.15 0 0.15 0.29 ...
##  $ giardia      : num  0 0 0 0 0.15 0 0 0.29 0 0.44 ...
##  $ hus          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ influenza    : num  0 0 0 0.15 0 0.15 0 0.29 0.58 0.15 ...
##  $ legionella   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meningokokken: num  0 0.44 0.15 0 0 0 0.44 0.29 0.15 0.15 ...
##  $ norovirus    : num  0 0 0 0.15 0.29 0.15 0.29 0.44 0.44 1.02 ...
##  $ rotavirus    : num  9.59 20.92 36.18 44.17 44.61 ...
##  $ salmonella   : num  1.45 1.89 2.03 2.32 1.6 1.31 1.02 1.31 2.03 1.45 ...
##  $ shigella     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ typhus       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ yersiniosis  : num  0.15 0.15 0.15 0 0 0.44 0 0 0 0.15 ...
##       date                 age       campylobacter       ecoli        
##  Min.   :2001-01-01   A00..00: 776   Min.   :0.000   Min.   : 0.0000  
##  1st Qu.:2004-09-21   A01..01: 776   1st Qu.:0.940   1st Qu.: 0.0300  
##  Median :2008-06-06   A02..02: 776   Median :1.430   Median : 0.0700  
##  Mean   :2008-06-05   A03..03: 776   Mean   :1.681   Mean   : 0.7247  
##  3rd Qu.:2012-02-20   A04..04: 776   3rd Qu.:2.150   3rd Qu.: 0.4200  
##  Max.   :2015-11-12   A05..09: 776   Max.   :8.740   Max.   :14.3400  
##                       (Other):7760                                    
##       ehec           giardia           hus             influenza      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000000   Min.   :  0.000  
##  1st Qu.:0.0000   1st Qu.:0.040   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median :0.0200   Median :0.090   Median :0.000000   Median :  0.000  
##  Mean   :0.1024   Mean   :0.116   Mean   :0.009513   Mean   :  1.366  
##  3rd Qu.:0.1100   3rd Qu.:0.140   3rd Qu.:0.000000   3rd Qu.:  0.260  
##  Max.   :2.2900   Max.   :1.570   Max.   :0.900000   Max.   :222.650  
##                                                                       
##    legionella      meningokokken       norovirus         rotavirus      
##  Min.   :0.00000   Min.   :0.00000   Min.   : 0.0000   Min.   :  0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 0.3775   1st Qu.:  0.180  
##  Median :0.00000   Median :0.00000   Median : 1.0100   Median :  0.510  
##  Mean   :0.00852   Mean   :0.03674   Mean   : 3.6344   Mean   :  6.128  
##  3rd Qu.:0.01000   3rd Qu.:0.02000   3rd Qu.: 2.9500   3rd Qu.:  3.330  
##  Max.   :0.35000   Max.   :1.31000   Max.   :78.1300   Max.   :184.350  
##                                                                         
##    salmonella        shigella          typhus          yersiniosis   
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.000000   Min.   :0.000  
##  1st Qu.: 0.500   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.030  
##  Median : 1.000   Median :0.0000   Median :0.000000   Median :0.090  
##  Mean   : 2.052   Mean   :0.0208   Mean   :0.001782   Mean   :0.323  
##  3rd Qu.: 2.280   3rd Qu.:0.0200   3rd Qu.:0.000000   3rd Qu.:0.350  
##  Max.   :23.370   Max.   :1.0700   Max.   :0.150000   Max.   :4.300  
## 
##  [1] "A00..00" "A01..01" "A02..02" "A03..03" "A04..04" "A05..09" "A10..14"
##  [8] "A15..19" "A20..24" "A25..29" "A30..39" "A40..49" "A50..59" "A60..69"
## [15] "A70..79" "A80."

I have data from 2001 until recently. The factor age is not evenly spaced, young people have finer granularity. Only three diseases have a median incidence above 1 in 100,000. An extreme example is influenza, with a maximum of 222 and a median of 0.000.

To see if my data wrangling is plausible I use the library hts to plot a grouped time series. The next plots show the incidences over time grouped by age and disease.

Some age groups seem to be more prone to these infectious diseases and there are seasonal patterns. There are long-term changes and only three diseases reach high incidences.

This looks quite promising for a plausibility check. I will adapt my selection of data for further exploration.

1.2 Analysis

What is the structure of your dataset?

The incidence of the diseases, as cases per 100,000, is reported by week number of the last 15 years and by age group of the patients. After combining 14 separate files, the dataset contains 12,416 observations of 16 variables: The incidence of the 14 diseases for each week and age group. I filled in zeros where necessary, so that all age groups and diseases are present at any week to help analyzing time series at later stages. The patient’s age is included in groups of each year for small children, groups of five years for persons from 5 to 29 years and 10-year groups for ages 30 to 79. People over 80 years are the last group.

What is/are the main feature(s) of interest in your dataset?

I’d like to know, when a patient sees a doctor with gastrointestinal symptoms, what are the most likely infectious diseases to check. From a first look, a couple of the diseases have a maximum incidence lower than 1 per 100,000, but norovirus, salmonella and campylobacter have medians above 1 per 100,000.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I assume the age of the patient and the time of the year to be other important factors. For a first glimpse, I plotted the dataset as a time series using hts, which revealed a seasonal pattern for some diseases and differences between the age groups.

Did you create any new variables from existing variables in the dataset?

No, existing values were transformed only.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The interface I had to extract the data from, SurvStat@RKI 2.0 https://survstat.rki.de , allows to download only two dimensions at once, for which I chose patient age group and the week. The 14 data files typically had 16 columns and 731 rows each. I combined these files into one by joining the separate data frames after I had changed the data format of some columns. Finally, I melted the combined data into one set of 12,416 observations of 16 variables.


Iteration 2

2.1 Plots

The first look at the data was so promising, that I decided to increase the number of variables I take into consideration. Unfortunately, the limitations of the interface required to download a total of 72 files to additionally include gender and region of the patients. I removed the five most rare diseases of iteration 1, because I think a maximum incidence below 1.5 in 100,000 is not helpful to answer the question what disease a patient is likely to have.

Again, I check the structure and get an overview:

## 'data.frame':    99456 obs. of  14 variables:
##  $ date  : Date, format: "2001-01-01" "2001-01-01" ...
##  $ age   : Factor w/ 16 levels "A00","A01","A02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ camp  : num  0 0 0 4.75 0 0 0 0.51 0.48 1.64 ...
##  $ week  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ecol  : num  1.66 0 0 0 0 0 0 0 0 0 ...
##  $ ehec  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ giar  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ infl  : num  0 0 0 0 0 0 0.27 0 0 0 ...
##  $ noro  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rota  : num  8.28 4.89 1.61 1.58 0 0.6 0.27 0.26 0.24 0 ...
##  $ salm  : num  0 0 0 3.17 1.56 1.49 0 0.26 0 0.47 ...
##  $ yers  : num  0 0 0 0 0 0.3 0.55 0 0 0 ...
##  $ gender: Factor w/ 2 levels "fem","mal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ region: Factor w/ 4 levels "e","n","s","w": 2 2 2 2 2 2 2 2 2 2 ...
##       date                 age             camp            week      
##  Min.   :2001-01-01   A00    : 6216   Min.   : 0.00   Min.   : 1.00  
##  1st Qu.:2004-09-23   A01    : 6216   1st Qu.: 0.76   1st Qu.:13.00  
##  Median :2008-06-10   A02    : 6216   Median : 1.41   Median :26.00  
##  Mean   :2008-06-09   A03    : 6216   Mean   : 1.79   Mean   :26.42  
##  3rd Qu.:2012-02-26   A04    : 6216   3rd Qu.: 2.29   3rd Qu.:39.00  
##  Max.   :2015-11-19   A05    : 6216   Max.   :24.81   Max.   :53.00  
##                       (Other):62160                                  
##       ecol              ehec              giar             infl        
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.0000   Min.   :  0.000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:  0.000  
##  Median : 0.0000   Median : 0.0000   Median :0.0000   Median :  0.000  
##  Mean   : 0.8432   Mean   : 0.1091   Mean   :0.1218   Mean   :  1.539  
##  3rd Qu.: 0.2600   3rd Qu.: 0.0000   3rd Qu.:0.1200   3rd Qu.:  0.120  
##  Max.   :50.4400   Max.   :10.4400   Max.   :8.2700   Max.   :319.190  
##                                                                        
##       noro             rota              salm             yers        
##  Min.   :  0.00   Min.   :  0.000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.:  0.21   1st Qu.:  0.120   1st Qu.: 0.400   1st Qu.: 0.0000  
##  Median :  0.93   Median :  0.480   Median : 0.930   Median : 0.0000  
##  Mean   :  4.16   Mean   :  6.902   Mean   : 2.193   Mean   : 0.3944  
##  3rd Qu.:  3.10   3rd Qu.:  3.170   3rd Qu.: 2.290   3rd Qu.: 0.2000  
##  Max.   :284.53   Max.   :466.500   Max.   :56.750   Max.   :24.6100  
##                                                                       
##  gender      region   
##  fem:49728   e:24864  
##  mal:49728   n:24864  
##              s:24864  
##              w:24864  
##                       
##                       
## 

This time, in addition to deriving a date from the week of the year in the original data, I keep the week number to facilitate the investigation of seasonality. Only nine diseases are included, but the incidences for regions of Germany and gender of the patient have been added.

I notice that the maximum incidences are higher now. The reason is that since I respect region and gender, I see the more extreme values, which have been averaged out before. The means are roughly the same, I think the difference is caused by rounding errors in the original data, which sets everything below 0.1 to 0.

Also, I plot a time series to see that the data wrangling worked. In addition to incidences over time grouped by age and disease, I will also plot grouped by region and gender:

The first plots are about the same as before. Adding gender is a bit disappointing, but there are differences between the regions.

The plots allow to check for outliers and plausability, but are not intended for thorough analysis. The spike at the end of 2009 seems to have hit both genders almost equally, and all regions to varying degrees, but only a narrow age group, that is a legitimate outlier. Overall, the lines rise and fall continuously, so I can assume that there are no single erroneous data points. No line is so different from the others that it could be a sum.

From the structure I can derive that histograms for all variables beside the disease incidences will be evenly distributed, because incidences of zero were filled to obtain a continuous time series.

If I temporarily remove these filled values I might get some insights:

At first I was suprised that the occurence of diseases seemed evenly distributed over all ages, then I recoded the factor levels:

Kids are more prone to infectious diseases it seems.

At last I have a look at the distribution of dates:

There are sometimes no or twice the counts at the change of the year. This might be caused by the 2-week holidays at the end of the year or by errors in date conversion.

Next, I have a look into the distribution of incidences, also experimenting on equidistant breaks on log scale:

Most incidences (per week, region, age, gender and disease) are around 1.5 cases per 100,000 people, it seems.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.040   0.280   0.940   3.532   2.410 466.500

It is hard to compare regions by number of disease occurences, so how are the incidences distributed between the regions?

High incidences seem to occur more often in the East.

I remember seasonal patterns from the initial time series plot. The next plots show the count of incidences above a threshold over the week of the year:

Because the data is current (updated every Wednesday) I have to consider that values for December are still lower at this time of the year. Salmonella and campylobacter have a moderate high in summer, but much higher incidences can be seen for noro- and rotavirus in winter and spring.

Are there long-term trends? I resample the time series as quarterly and yearly values for the disease incidences:

From the plots it seems as if Norovirus is on the rise while rotavirus infections are declining. Immunization shots against rotavirus have been available since 2006 and became officially recommended (and paid for) in 2013. I will have a look into the long-term trends in more detail later.

I have to consider that these are the cases that were reported to the Robert Koch Institute. The inclination to report the incidents might be influenced by, for instance, the amount of paperwork required (2 pages) or the awareness or diagnostic capabilities of the doctors.

I tried to find additional data sources about incidences of gastroenteritis from a health insurance. It contains the cases diagnosed as gastroenteritis which caused sick leaves in the years 2002-2008.

With the help of a doctor I found out that at least 80% of the cases have an infectious cause, which means I should see a yearly incidence of well above 3,000 in the RKI’s data. The plotted values are much lower. I checked this against the yearly values through the institute’s interface and noticed even much lower values (240-560) there.

I think the reason for the different values is that some incidences are summed up when actually a mean should be calculated. For instance, an incidence of 4 in Region East and 6 in Region North does not mean a combined incidence of 10, but of 5. On the other hand, incidences for diseases and ages should be added.

I’ll start over with separate values for cases and population.

2.2 Analysis

What is the structure of your dataset?

The dataset contains 99,456 observations of 13 variables. For 777 weeks I have the incidence (as cases per 100,000) of nine not very rare infectious diseases, which are related to gastrointestinal symptoms, for patients grouped by age, gender and region of Germany. Again, the observations have been filled with zeros to help with time series analysis later on. A redundant column for the week of the year has been included. The regions are North, East, South and West, all of which contain bigger cities and rural areas. Data with unclear gender was already omitted when downloading the data. The size of the dataset is approximately 6 MB.

What is/are the main feature(s) of interest in your dataset?

I’d like to know, when a patient sees a doctor with gastrointestinal symptoms, what are the most likely infectious diseases to check. From a first look, gender does not seem to be very important, but age seems to make a difference.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I assume the region of the patient and the time of the year to be other important factors. My time series plots for validation hinted at these relationships.

Did you create any new variables from existing variables in the dataset?

I temporarily created a variable for age with factor levels of equal width.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

As before, I had to combine several files into one. This time though, I shortened the names of factor levels to a fixed length to make use of the automatic grouping of the hts library.


Iteration 3

3.1 Plots

This time I am going to calculate the incidence by myself, because the data source did not clearly document on which numbers they calculated it and they were not available for questions.

Instead of the incidence there is now a case count. To calculate the incidence population year is included to match with population data. The data contains age as a number now instead of age groups. The date is supplemented by the week of the year.

## 'data.frame':    2283228 obs. of  7 variables:
##  $ date      : POSIXct, format: "2001-01-01" "2001-01-01" ...
##  $ disease   : Factor w/ 9 levels "camp","ecol",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ week      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pop_year  : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ age       : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ case_count: int  1 2 1 5 0 1 0 0 0 0 ...
##  $ region    : Factor w/ 4 levels "e","n","s","w": 2 2 2 2 2 2 2 2 2 2 ...
##       date                        disease            week     
##  Min.   :2001-01-01 00:00:00   camp   :253692   Min.   : 1.0  
##  1st Qu.:2004-09-30 00:00:00   ecol   :253692   1st Qu.:14.0  
##  Median :2008-07-01 00:00:00   ehec   :253692   Median :27.0  
##  Mean   :2008-06-30 11:25:58   giar   :253692   Mean   :26.6  
##  3rd Qu.:2012-04-01 00:00:00   infl   :253692   3rd Qu.:40.0  
##  Max.   :2015-12-31 00:00:00   noro   :253692   Max.   :53.0  
##                                (Other):761076                 
##     pop_year         age       case_count     region    
##  Min.   :2000   Min.   : 0   Min.   :  0.00   e:570807  
##  1st Qu.:2004   1st Qu.:20   1st Qu.:  0.00   n:570807  
##  Median :2008   Median :40   Median :  0.00   s:570807  
##  Mean   :2007   Mean   :40   Mean   :  1.78   w:570807  
##  3rd Qu.:2012   3rd Qu.:60   3rd Qu.:  1.00             
##  Max.   :2014   Max.   :80   Max.   :823.00             
## 

First, I try to replicate the last plot:

The numbers now match the values I got from the data source. In a perfect world, the green and black lines would be much higher than the blue line.

3.1.1 Disease Incidences vs. Time

This also means that I have to check the values of other plots I consider for my final plots. Here is the update for the long-term trends of diseases:

Let’s see how a linear model performs for some diseases.

Numerical values for norovirus:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.9160 -1.0160 -0.4707  0.0000  0.4400  9.2860

This is rather disappointing, especially the seasonality stays unexplained. I will try a seasonal decomposition based on linear smoothing, on the example of norovirus:

##  Call:
##  stl(x = noro_ts, s.window = "periodic")
## 
##  Time.series components:
##     seasonal              trend             remainder        
##  Min.   :-1.2406998   Min.   :-0.843619   Min.   :-2.062319  
##  1st Qu.:-1.0785987   1st Qu.: 0.487411   1st Qu.:-0.715884  
##  Median :-0.3043678   Median : 1.861842   Median :-0.062093  
##  Mean   : 0.0030403   Mean   : 1.641958   Mean   : 0.011936  
##  3rd Qu.: 0.9728293   3rd Qu.: 2.567985   3rd Qu.: 0.590311  
##  Max.   : 2.0875549   Max.   : 3.681525   Max.   : 5.452671  
##  IQR:
##      STL.seasonal STL.trend STL.remainder data 
##      2.051        2.081     1.306         1.925
##    % 106.6        108.1      67.9         100.0
## 
##  Weights: all == 1
## 
##  Other components: List of 5
##  $ win  : Named num [1:3] 7831 79 53
##  $ deg  : Named int [1:3] 0 1 1
##  $ jump : Named num [1:3] 784 8 6
##  $ inner: int 2
##  $ outer: int 0

The remainder still shows a seasonality, but values are much better than without seasonality. I’d like to try out some forecasting models to obtain more insights into seasonality and trend. I start with the package hts and use a TBATS model from packgage forecast. This model uses spectral analysis and exponential smoothing to fit the time series.

This model clearly incorporated the seasonality. Here are the numeric values for the residuals of norovirus:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.89400 -0.11450  0.01817  0.01015  0.13750  2.49300

Next is package season, with which I fit a non-stationary cosinor model (sinusoid smoothing) to norovirus only.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.3000000 -0.0709400  0.0075020  0.0009404  0.0929200  2.4730000

Compared to hts/ TBATS, the residuals are higher.

The frequently used library forecast has been mentioned before. When used directly, it allows access to more details:

This means that it was determined
  • not to transform the data using the Box-Cox funtion
  • use an ARMA(3, 4) process for error modeling, which means
    • autoregression of order 3 and
    • moving average of order 4
  • to model the seasonality of 53 weeks with 5 Fourier terms.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.89400 -0.11450  0.01817  0.01015  0.13750  2.49300

The numerical summary shows identical values as above.

The models identify a seasonality, but the forecast confidence interval is rather large. I will stick with a simple plot:

This is going to be a final plot. because it allows to easily see the highest incidences for a time of the year.

3.1.2 Disease Incidence vs. Age

I revisit the disease distribution by age, only this time incidences are not only counted but summed up.

As a bar plot with color indicating the age distribution:

And finally as dot area size representing incidence, original and scaled:

I think the difference between campylobacter, norovirus and salmonella on the one hand and rotavirus on the other is much clearer in the first of these plots. This could be the basis for a final plot.

I had not expected that the differences are that big. Again, I have to consider that the data contains reported cases, maybe adults just do not see a doctor when they have the syptoms.

All plots show that only 3 to 5 diseases seem to be responsible for the majority of cases, which is also something I wanted to investigate earlier section.

3.1.3 Disease Incidences vs. Time Revisited

Here is the variability the monthly incidences of each disease in Germany for all ages.

Influenza has a low mean but quite a few outliers, up to 109. Together with the long-term plots from above I conclude that influenza has a rather short season, while rota- and norovirus stay for a longer time of the year.

3.1.4 Disease Icidences vs Region

Next, I’m going to explore how region and disease are related.

The regions are aligned to a windrose for a more intuitive understanding. Because the values are proportionate to the petal radius, differences seem exaggerated.

Region East shows higher incidences in both plots, maybe this is caused by an one-time event.

The upper plots show incidences in the east and the rest, the lower plot the difference between them. Region East has higher incidences for all diseases in almost every single month. Are doctors there more eager to report or is this caused by the age structure?

The size of the dots shows what proportion the different ages have within the east and the rest, while color indicates how big the difference between the regions is. The East has relatively fewer people around 20, and more people 60+. The difference in small kids is relatively low, though, and I should not forget that I am comparing incidences as cases per 100,000 already.

3.1.5 Disease Incidences vs. Age and Region

The plot shows the relative difference of incidences (in percent, size of dots) for the diseases and various ages. There is one dot in the color of the region with the highest deviation from the mean for every year from 2001 to 2015.

For rotavirus, there is a rather big deviation for all ages and years in region East. For almost all diseases, region East deviates most for people under 18. Historically, there was a much more centralised healthcare in the East, which might still influence the tendency of doctors to report cases. There are more patterns in the plot, for which I do not have a potential explanation.

3.1.6 Seasonality of Disease Incidences vs. Time

After the exploration of age/ region/ disease is exhausted, I will have a look at the seasonality again.

The seasonality, which showed in the plot of the annual means above, is still there, but there are some changes over time. When I forecast noro- or rotavirus later, I will limit the data to years from 2008:

Norovirus has high incidences in small kids and older people. Maybe the seasonality varies with age for norovirus.

3.1.7 Selected Incidence Seasonality vs. Age and Time

What is the pattern for campylobacter?

While the seasonal pattern is similar within the age groups (columns), it looks different between the groups for norovirus. I will consider that in the final version of the seasonality plot.

3.1.8 Selected Incidence Seasonality vs. Age and Region

Before I try a forecast, I have a look at the seasonality of some diseases by ages and regions:

There is no real difference between the regions. The diseases seem to hit all ages at once within the season.

3.1.9 Forecasting Revisited

With the new insights I try to forecast some incidences again. I limit to data from 2008 on, and predict seperately for three age groups (young, middle aged, and older).

The thick black and the thin red line are the actual incidence of a disease and age group. The dashed line is the forecast with a .5 confidence interval. The thin black line shows the residuals of the model.

I would not have expected the forecasts to be that good.

I think I have enough insight now to prepare the final plots.

3.2 Analysis

What is the structure of your dataset?

The dataset consists of two parts. To compute the incidences on my own, I downloaded the population data of the German states from https://www-genesis.destatis.de/genesis/online/link/tabellen/12411-0011 . There are limitations on the size of data you can obtain for free, so it is only biennial. Initially, this first part of the data contained 736 obs. of 18 variables, but I reduced it to 648 obs. of 6 variables by summing values for regions and ages 80+.

The second part contains 2,283,228 observations of 7 variables. For 778 weeks I have the case counts of nine not very rare infectious diseases, which are related to gastrointestinal symptoms, for patients grouped by age and region of Germany. Two redundant columns for the week of the year and the year to match with the population data have been included. The size of this dataset is approximately 80 MB.

Did you create any new variables from existing variables in the dataset?

The incidence was calculated, often per plot, to ensure the right case and population counts are matched against each other.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

To give an example, the age was included in the form “6 year olds” or “over 90 year olds” in the population data. I had to extract the numeric value and, because the disease data grouped everyone over 80 together, had to sum all ages above 80. Additionally, I had to sum the values for the various states of Germany to obtain population numbers for the regions.

In terms of unusual distributions, when I look at the quarterly incidences, there is a spike in EHEC diseases and E.Coli drops to 0 in 2015.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Season of the year and age are the major factors when predicting the disease of a patient with gastrointestinal symptoms. I could not find an explanation for the generally higher incidences of one region, the East.

In a side exploration with incidence from another source it became clear that the aspect of reporting behavior is important, but beyond the scope of the data set.

What was the strongest relationship you found?

I think the time of the year has the strongest influence on incidence, but I am not sure if I can quantify that.

Were there any interesting or surprising interactions between features?

I could not find a relationship between age and season or region and season for the diseases. I decided to limit further exploration to years from 2008 on, because of a change in incidences I had not noticed before.

Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I forecasted incidences for some diseases and age groups, which were surprisingly accurate. Nevertheless, the forecast is based on the assumption of continuity, which limits the ability to foresee unusual events.


Final Plots and Summary

Plot One

Description One

The plot shows the mean annual incidence of the top five diseases for the age of the patient. The gastro-intestinal symptoms of a middle-aged patient are more likely to be caused by campylobacter than rotavirus.

I chose this plot, because it allows to put a main finding of my exploration to use in an intuitive way. For my initial question, which disease a patient is most likely to have, this plot allows an easy lookup by comparing dot sizes.

Plot Two

Description Two

These three plots show the incidence of four diseases during the course of the year for different age groups. The symptoms of a middle-aged patient are more likely to be caused by norovirus in winter and campylobacter in summer.

I chose this plot, because on the one hand it illustrates the finding that the season of the year is a major influence on disease incidence. On the other hand it refines the insight that age has an important impact, too, and reminds looking in the calender is not sufficient.

Plot Three

Description Three

The two plots above show forecasts for the incidence of norovirus for two age groups. The actual incidence (red line) stays mostly within a 50%-confidence interval (grey ribbon) of the forecast.

I chose this plot, because it shows that if age and seasonality are respected, a quite accurate prediction is possible.


Reflection

I had to start over twice before I was able to choose a good selection of data from the source. Contrary to my initial belief, I had to calculate the incidence on my own, using a supplementary dataset of population data. I had not expected that the variable time could be so versatile. It also contains the season and various degrees of granularity, of course.

The libraries which I found most useful for analyzing time series are lubridate for date conversions, forecast for modelling, changepoint, and ggfortify for plotting.

During my analysis variables age and season of the year emerged as major factors. I could not explain higher incidences in region East. Further, when contrasting with data from another source it became clear that the way how the data is gathered by the Robert Koch Institute limits how well it describes the reality.

Later in my exploration, I identified changepoints in the data which helped to limit my data to relevant years. I could not find a relationship between age and season or region and season, but I was able to forecast some diseases rather accurately.

If I were able to access the data source without having to download separate sheets manually I could try to get insights from finer geographical information. Data on county level could be plotted on a map of Germany and the change of incidences over the year could be animated. It would be interesting to see if there are pockets where some diseases stay all year or if they spread from airports to the countryside.